On Retrieving Legal Files: Shortening Documents and Weeding Out Garbage
نویسندگان
چکیده
This paper describes our participation in the TREC Legal experiments in 2007. We have applied novel normalization techniques that are designed to slightly favor longer documents instead of assuming that all documents should have equal weight. We have also developed a new method for reformulating query text when background information is provided with an information request. We have also experimented with using enhanced OCR error detection to reduce the size of the term list and remove noise in the data. In this article, we discuss the impact of these effects on the TREC 2007 data sets. We show that the use of simple normalization methods significantly outperforms cosine normalization in the legal domain.
منابع مشابه
Improving Search and Retrieval Performance through Shortening Documents, Detecting Garbage, and Throwing Out Jargon
This thesis describes the development of a new search and retrieval system used to index and process queries for several different data sets of documents. This thesis also describes my work with the TREC Legal data set, in particular, the new algorithms I designed to improve recall and precision rates in the legal domain. I have applied novel normalization techniques that are designed to slight...
متن کاملInvestigating Legal Loopholes in the Field of Official Documents in Iran and its Ethical Implications
Background: In the Law on registration of deeds and real estate, the definition of official document and the scope of inclusion of official documents are different from civil law, and these definitions create different interpretations and effects in society and how to deal with legal issues and problems. Resolving legal deficiencies in answering accidental questions that occur in the community,...
متن کاملLegal Documents Clustering using Latent Dirichlet Allocation
At present due to the availability of large amount of legal judgments in the digital form creates opportunities and challenges for both the legal community and for information technology researchers. This development needs assistance in organizing, analyzing, retrieving and presenting this content in a helpful and distributed manner. We propose an approach to cluster legal judgments based on th...
متن کاملBelgisch Staatsblad Corpus: Retrieving French-Dutch Sentences from Official Documents
We describe the compilation of a large corpus of French-Dutch sentence pairs from official Belgian documents which are available in the online version of the publication Belgisch Staatsblad/Moniteur belge, and which have been published between 1997 and 2006. After downloading files in batch, we filtered out documents which have no translation in the other language, documents which contain sever...
متن کاملCategorisation by Context
Assistance in retrieving of documents on the World Wide Web is provided either by search engines, through keyword based queries, or by catalogues, which organise documents into hierarchical collections. Maintaining catalogues manually is becoming increasingly difficult due to the sheer amount of material, and therefore it will be necessary to resort to techniques for automatic classification of...
متن کامل